Explicit formulas for efficient 
multiplication in W^em 



Elisa Gorla^, Christoph Puttmann^, and Jamshid ShokroUahi'^ 

^ University of Zurich, Switzerland 
elisa. gorla@math. unizh. ch 
^ Heinz Nixdorf Institute, University of Paderborn, Germany 
puttmannShni . upb . de 
^ Ruhr University, Bochum, Germany 
j amshid@crypto . rub . de 



Abstract. Efficient computation of the Tate pairing is an important 
part of pairing-based cryptography. Recently with the introduction of 
the Duursma-Lee method special attention has been given to the fields 
of characteristic 3. Especially multiplication in F-jem, where m is prime, 
is an important operation in the above method. In this paper we propose 
a new method to reduce the number of Fa^ -multiplications for multipli- 
cation in F^em from 18 in recent implementations to 15. The method is 
based on the fast Fourier transform and its explicit formulas are given. 
The execution times of our software implementations for F^em show the 
efficiency of our results. 



Keywords: Finite field arithmetic, fast Fourier transform, Lagrange inter- 
polation, Tate pairing computation 

1 Introduction 

Efficient multiplication in finite fields is a central task in the implementation of 
most public key cryptosystems. A great amount of work has been devoted to 
this topic (see [1] or [2] for a comprehensive list). The two types of finite fields 
which are mostly used in cryptographic standards are binary finite fields of type 
and prime fields of type Fp, where p is a prime (cf. [3]). Efforts to efficiently 
fit finite field arithmetic into commercial processors resulted into applications 
of medium characteristic finite fields like those reported in [4] and [5] . Medium 
characteristic finite fields are fields of type Fpm , where p is a prime slightly smaller 
than the word size of the processor, and has a special form that simplifies the 
modular reduction. Mersenne prime numbers constitute an example of primes 
which are used in this context. The security parameter is given by the length of 
the binary representations of the field elements, and the extension degree m is 
selected appropriately. Due to security considerations, the extension degree for 
fields of characteristic 2 or medium characteristic is usually chosen to be prime. 

With the introduction of the method of Duursma and Lee for the computation 
of the Tate pairing (see [6]), fields of type Fa^ for m prime have attracted 
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special attention. Computing the Tate pairing on elliptic curves defined over Fsm 
requires computations both in F3™ and in Fsem. In the paper [7], calculations 
are implemented using the tower of extensions 

W^m C F32m C F36m . 

Multiplications in F32m and Fsem are done using 3 and 6 multiplications, respec- 
tively. This requires a total 18 multiplications in F31.1 . In this paper wc make 
use of the same extension tower, using 3 multiplications in F3m to multiply el- 
ements in F32m . Since we represent the elements of W^em. as polynomials with 
coefficients in F32m, wc can use Lagrange interpolation to perform the multi- 
plication. This requires only 5 multiplications in F32m, thus reducing the total 
number of F3m multiplications from 18 to 15. The method that we propose has a 
slightly increased number of additions in comparison to the Karatsuba method. 
Notice however that for m > 90 (which is the range used in the cryptographic 
applications) a multiplication in F3m requires many more resources than an ad- 
dition, therefore the overall resource consumption is reduced, as also shown by 
the results of our software experiments shown in Section 4. 

In comparison to the classical multiplication method, the Karatsuba method 
(see [8], [9], and [7]) reduces the number of multiplications while introducing 
extra additions. Since the cost of addition grows linearly in the length of the 
polynomials, when the degree of the field extension gets larger multiplication 
will be more expensive than addition. Hence the above tradeoff makes sense. 
The negligibility of the cost of addition compared to that of multiplication has 
gone so far that the theory of multiplicative complexity of bilinear maps, espe- 
cially polynomial multiplication, takes into account only the number of variable 
multiplications (see e.g. [10] and [11]). Obviously this theoretical model is of 
practical interest only when the number of additions and the costs of scalar 
multiplications can be kept small. A famous result in the theory of multiplica- 
tive complexity establishes a lower bound of 2n -|- 1 for the number of variable 
multiplications needed for the computation of the product of two polynomials 
of degree at most n. This lower bound can be achieved only when the field con- 
tains enough elements (see [12] or [13]). The proof of the theorem uses Lagrange 
evaluation-interpolation, which is also at the core of our approach. This is sim- 
ilar to the short polynomial multiplication (convolution) methods for complex 
or real numbers in [14]. In order for this method to be especially efficient, the 
points at which evaluation and interpolations are done are selected as primitive 
(2n -I- l)st roots of imity. In a field of type F32m , fifth roots of unity do not exist 
for odd m. We overcome this problem by using fourth roots of unity instead. No- 
tice that a primitive fourth root of unity always exist in a field of type F32m . We 
use an extra point to compute the fifth coefficient of the product. An advantage 
of using a primitive fourth root of unity is that the corresponding interpolation 
matrix will be a 4 x 4 DFT matrix, and the evaluations and interpolations can 
be computed using radix-2 FFT techniques (see [15] or [16]) to save some fur- 
ther number of additions and scalar multiplications. The current work can be 
considered as the continuation of that in [17] for combination of the linear- time 
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multiplication methods with the classical or Karatsuba ones to achieve efficient 
polynomial multiplication formulas. 

Our work is organized as follows. Section 2 is devoted to explaining how 
evaluation-interpolation can be used in general to produce short polynomial 
multiplication methods. In Section 3 wc show how to apply this method to our 
special case, and produce explicit formulas for multiplication of polynomials 
of degree at most 2 over F32m. In Section 4 we fine-tune our method using 
FFT techniques, and give timing results of software implementations and also 
explicit multiplication formulas. Section 5 shows how our results can be used in 
conjunction with the method of Duursma-Lee for computing the Tate pairing 
on some elliptic and hyperelliptic curves. Section 6 contains some final remarks 
and conclusions. 



2 Multiplication using evaluation and interpolation 



Wc now explain the Lagrange evaluation-interpolation for polynomials with co- 
efficients in Fpm . Throughout this section m is not assumed to be prime (in the 
next section we will replace m by 2m). Let 

a{z) = ao + aiz + h a„z" G Fj,™ [z] 

b{z) =ao + a\z H h a„^;" e Fj,m [z] 



be given such that 

> 2n. 

We represent the product of the two polynomials by 

c{z) = a{z)b{z) ^ Co + ciz + h C2„z^" 



(1) 



and let e = (eo,--- ,e2n) 6 F^H^^ be a vector with 2n + 1 distinct entries. 
Evaluation at these points is given by the map (f)e 



M) = (/(eo 



,/(e2n)). 



Fpm^"'^ denote the vectors (ao, • • • , a„, 0, • • • , 0), 



Let A,B,C e 
(6o,--- ,6„,0, ••• ,0), and (cq 
we have 

<^e(a) = KA^, Mb) = VeB^, and 
where 14 is the Vandermonde matrix 



C2n), respectively. Using the above notation 

(c) = VeC^, 



Ve = 



/I eo 

1 ei 



2n 
1 



e 
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The 2n+l coefficients of the product c{z) = a{z)-b{z) can be computed using 
interpolation apphed to the evaluations of c{z) at the chosen 2n + 1 (distinct) 
points of Fpm . These evaluations can be computed by multiplying the evaluations 
of a{z) and b{z) at these points. This can be formally written as 

(j)e{c) = (j)e{a) * (l)e{b) 

where we denote componentwise multiplication of vectors by *. Equivalently, if 
we let We be the inverse of the matrix Ve, we have that 

=We{Ma)*Mb)) 

which allows us to compute the vector C, whose entries are the coefficients of 
the polynomial c(z). 

When condition (1) is satisfied, the polynomial multiplication methods con- 
structed in this way have the smallest multiplicative complexity, i.e. the number 
of variable multiplications in Fpm achieves the lower bound 2n + 1 (see [12]). 
Indeed (1) can be relaxed to hold even for p™ = 2n. In this virtual ele- 

ment oo is added to the finite field. This corresponds to the fact that the leading 
coefficient of the product is the product of the leading coefficients of the factors. 

Application of this method to practical situations is not straightforward, 
since the number of additions increases and eventually dominates the reduction 
in the number of multiplications. In order for this method to be efficient, n 
must be much smaller than p™. An instance of this occurs when computing in 
extensions of medium size primes (see e.g. [13]). The case of small values of p 
is more complicated, even for small values of n. We recall that in this case the 
entries of the matrix Ve are in Fpm and are generally represented as polynomials 
of length m — 1 over Fp. For multiplication of Ve by vectors to be efficient, the 
entries of this matrix must be chosen to be sparse. However, this gives no control 
on the sparsity of the entries of We- Indeed one requirement for the entries of 
We, in the basis B, to be sparse is that the inverse of the determinant of Ve, 
namely 

n (e« - ej) 

has a sparse representation in B. We are not aware of any method which can be 
used here. On the other hand, it is known that if the Cj's are the elements of 
the geometric progression a;', < i < 2n, and w is a (2n -|- l)st primitive root 
of unity, then the inverse We equals l/(2n + 1) times the Vandermonde matrix 
whose ej's are the elements of the geometric progression of lo~^ (see [2]). We 
denote these two matrices by V^ and V^-i , respectively. The above fact suggests 
that choosing powers of roots of unity as interpolation points should enable us 
to control the sparsity of the entries of the corresponding Vandermonde matrix. 
Roots of unity are used in different contexts for multiplication of polynomials, 
e.g. in the FFT (see [2]) or for the construction of short multiplication methods 
in [14]. In the next section we discuss how to use fourth roots of unity to compute 
multiplication in Fp6m , using only 5 multiplications in F32m . 



5 



3 Multiplication using roots of unity 



Elements of F^e^ can be represented as polynomials of degree at most 2 over 
F32m. Therefore, their product is given by a polynomial of degree at most 4 
with coefficients in F32m . In order to use the classical evaluation-interpolation 
method we would need a primitive fifth root of unity. This would require 3^™ — 1 
to be a multiple of 5, and this is never the case unless m is even (recall that 
cryptographic applications require m to be prime). However using the relation 



C4 = 02^2 

we can compute the coefficients of c{x) via 



/1 1 1 1\ 




(cA 


( 


1 CJ^ 




Cl 




lu;2 1 ^2 




C2 




\la;^a;2 u j 






\ 



a{u])h{uj) - 
a(w2)6(w2) 



- Ca 



(2) 



(3) 



where w is a fourth root of unity. Now we apply (2) and (3) to find explicit 
formulas for multiplying two polynomials of degree at most 2 over F32m , where 
m > 2 is a prime. 

We follow the tower representation of [7] , i.e. 



Fan. ^F3[x]/(/(x)) 
F32™ ^F3m[j/]/(y2 + i) 



(4) 



where f{x) G F3[a;] is an irreducible polynomial of degree m. Denote by s the 
equivalence class of y. Note that for odd m > 2, 4 ^3™ — 1 and hence j/^ + 1 is 
irreducible over Fs™ since the roots of + 1 arc fourth roots of unity. Let 



a{z) ~ 

be polynomials in ¥^^2 
nomial 



aQ + aiz + a2Z, b{z) = bo + biz + (5) 
™ [z] . Our goal is computing the coefficients of the poly- 



c{z) = a{z)b{z) = Co + ciz H C4Z . 

Evaluation of a{z) and b{z) at (1, ,s, ,s^, s'"*) = (1, s, —1, — s) can be done by mul- 
tiplying the Vandermonde matrix of powers of s 



/II 1 1 \ 
1 s -1 -s 

1-11-1 

1 -s -1 s 



(6) 



by the vectors (aoj ^ij ^2, 0)"^ and (60,61,62,0)-^, respectively. This yields the 



vectors 



(j)e{a) 



( ao + ai + 02 \ 
ao + sa\ — 02 
cto - ai + 02 

\ ao - sax -a-i) 



and 



1 60 + 61 + 62 \ 

60 + s6i - 62 
60 - 61 + 62 
\6o - s6i - 62/ 
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Let (j>eic) = (pe{a) * <Peib) be the componentwise product of 0e(a) and (j)e{b) 



/Po\ 

Pi 

P2 
\PzJ 



( (ao + ai + a2)(6o + J*! + ^^2) \ 
(ao + sa\ - a2)ibo + sbi - 62) 

(ao ~ai+ a2)(5o - 61 + 62) 
\ (ao - sai - a2)(6o - sbi - 62) / 



Using (2) and (3) we get 



(co\ 




(Po 


-PA 


Cl 




Pi 


~pi 


C2 


P2 


-Pi 


\cs) 






-P4J 



where P4 = 02^2 and 

/I 1 1 1 \ 
1-s-l s 
1-11-1 
\1 s -1-sJ 

Thus the expHcit formulas for the coefBcients of the product are 

Co = Po + A + P2 + f 3 - Pi 

Cl = Po - sPi -P2 + sPs 

C2 = Po - Pi + P2 - P3 
C3 = Po + sPl - P2 - sPs 
C4 = Pi. 



(7) 



(8) 



4 Efficient implementation 



Wc owe the efficiency of our method to tlic Cooley-Tukey factorization of the 
DFT matrix ([15]). The matrices Vg and Wg in (6) and (7) are not sparse, but 
they are the DFT matrices of the fourth roots of unity s and s^, respectively. 
Hence they can be factored as a product of two sparse matrices as shown in (9) 
and (10). 



/II 1 1 \ 

1 s -1 -s 

1-11-1 

^1 -s -1 s 

(I 1 1 

1-s-l s 

1-11-1 

1 s -1 -s 



1\ 



/I 1 \ 
1s 

1-10 
\0 1 -s/ 

/I 1 \ 

1 -s 
1-10 

\o 1 s y 



/lO 1 
01 



\ 

1 



10-10 
\0 1 

/lO 1 

1 

1 -1 
\Q 1 



(9) 



(10) 



-1/ 



The factorizations in (9) and (10) allow us to efficiently compute the product of 
the matrices Vs and Ws with vectors. Notice also that the product of an element 
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to = us + v € Fsm [s]-^ = F32m with s equals vs — u. Hence multiplying by s an 
element of F32m is not more expensive than a change of sign. 

Notice that in alternative to the Vandermonde matrix corresponding to s we 
could use the matrix 

/I \ 

1111 
1-11-1 

\1 s -1-sJ 

whose inverse is 

/ 1 \ 

s 1 — s — 1 — s s 
-1 -1-10' 
\-s 1 + s -1 + s -s J 

Obviously the latter matrices are sparse but since they do not possess any special 
structure up to our knowledge, multiplying them by vectors is more expensive 
than multiplying Vs and Ws- 

Multiplying elements in the field F36.97 is required in the Tate pairing com- 
putation on the group of F397 -rational points of the elliptic curves 

Ed : y"^ = - X + d d £ {-1,1} 

defined over F3 . An efficient algorithm for the computation of the Tate pairings 
on these curves is discussed in [6]. 

We have implemented the multiplication over F36 97 using the Karatsuba 
method, the Montgomery method from [18], and our proposed method on a 
PC with an AMD Athlon 64 processor 3500+. The processor was running at 
2.20 GHz and we have used the NTL library (see [19]) for multiplication in F397. 
Please note that although we have chosen m = 97 for benchmarking purposes, 
these methods can be applied to any odd to > 2 as mentioned in Section 3. 



Multiplication method 


Elapsed time (ms) 


Karatsuba method 
Montgomery method 
Proposed method 


1.698 
1.605 
1.451 



Table 1. Comparison of the execution times of the Karatsuba and Montgomery mul- 
tipliers with the proposed method for Fgem . 



The execution times are shown in Table 1. For the Karatsuba and the pro- 
posed methods we have used the tower of extensions 



F397 C F32-97 C F36-97, 
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where 

F397 ^F3[x]/(x9^ + xi'^ + 2) 
F32.97 ^F397[y]/(y2 + l) 

F36.97 ^ F32.97[^]/(^^ -Z-1), 

whereas for the Montgomery method the representation 

F36.97 ^F397 [y]/(y6 + y-l) 

has bc;cn used. Our implementations show that the new method is almost 14% 
faster than the Karatsuba and 10% faster than the Montgomery method, which 
is almost the ratio of saved multiplications. This provides further evidence for 
the fact that the number of multiplications in F397 is a good indicator of the 
performance of the method for F36 97. 

Our multiplications are based on the following formulas. Let a, /3 e F36 m be 
given as: 

a = ao + ais + a2r + a^rs + a^r^ + a^r^s, 

P = bo + bis + b2r + b3rs + biV^ + b^r'^s, 

where ag, • • • , 65 G Fs^ and s G -Ff , r G F| ™ are roots of y^ + 1 and z^ — z — 1, 
respectively. Let their product 7 = a/? g Fse ™ be 

7 = Co + CiS + C2r + czrs + c^r^ + c^r'^s. 

The coefficients q, for < i < 5 arc computed using: 





(ao + a2 + 04) (60 + b2+ 64) 


Pi = 


(ao + ai + 02 + 0.3 + (H + 05) (^0 + bi + b2 + b3 + bi + 65) 


P2 = 


(ai + 03 + 05) (61 + bs + 65) 


^3 = 


(ao - 03 - 04) (60 - ki - 64) 


Pi = 


(ao + ai + a2 - aa - 04 - 05) (60 + 61 + 62 - &3 - &4 - ^5) 


P5 = 


(ai + 02 - a5){bi + 62 - 65) 


^6 = 


(ao - 02 + 04) (&o -b2 + /m) 


P7 = 


(ao + Oi - 02 - 03 + 04 + 05) (60 + 61 - 62 - ^3 + &4 + ^5) 


P8 = 


(ai - 03 + as) (61 - &3 + ^5) 


P9 = 


(ao - 03 - a4)(6o + &3 - ^4) 


P10-- 


= (ao + ai - 02 + 03 - a4 - a5)(&o + 61 - 62 + &3 - &4 - ^5) 


Pll 


= (ai - a2 - a5)(6i - 62 - 65) 


Pl2 -- 


= 0464 


Pl3 = 


= (04 + as) (64 + &5) 


Pm-- 


= 0.5 &5 


co = 


-Po + P2-Pz-Pa + Pw + Pii - P12 + Pu; 


Cl = 


P0-P1+P2+P4 + P5 + P9 + Pw + P12 - Pl3 + Pli 


C2 = 


-Po +P2+P6-P8+ Pl2 - Pli 


cs = 


P0-P1+P2-P6 + P7-P8- -P12 + Pl3 - Pli 


C4 = 


P0-P2-P3 + P5 + P6-P8-P9 + Pll + P12 - Pli 



C5 = -Po + Pi - P2+ P3- Pi+ P5 - P6+ P7 - P8+ P9 - P10 + 
Pll — P12 + Pl3 — Pli 
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5 Other applications of the proposed method 

Consider the family of hyperelliptic curves 

Cd-. = xP -x + d d€ {-1,1} (11) 

defined over Fp, for p = 3 mod. 4. Let m be such that (2p, m) = 1 (in practice 
m will often be prime), and consider the Fpm -rational points of the Jacobian 
of Cd- An efficient implementation of the Tate pairing on these groups is given 
by Duursma and Lee in [6] and [20], where they extend analogous results of 
Barreto et. al. and of Galbraith et. al. for the case p = 3. Notice that this family 
of curves includes the elliptic curves Ed that we mentioned in the last section. 
In the aforementioned papers it is also shown that the curve Cd has embedding 
degree 2p. In order to compute the Tate pairing on this curve, one works with 
the tower of field extensions 

Fpm C Fp2m C Fp2pm 

where the fields are represented as 

Fp2m ^ Fpm [y]/{y'^ + 1) and Fp2pm ^ Fp2m [z]/{zP -z + 2d). 
Let a{z), b{z) e Fp2pm [z]^^-'^, 

a{z) = ao + aiz + . . . + ap-iz^~^, 

b{z) = bo + biz + ... + bp-izP-'^. 
Then c{z) = a{z)b{z) has 2p — l coefficients, two of which can be computed as 

Co = aobo and C2(p_i) = a2(p_i)62(p-i)- 

In order to determine the remaining 2p — 3 coefficients, we can write a Vander- 
monde matrix with entries in ¥*2m using, e.g., the elements 

p — 3 p — 1 
l,2,...,p-l,±s,...,±—^s,—^s. 

Another option is writing a Vandermonde matrix using a primitive 
2{p — l)-st root of unity combined with the relation: 

C2(p-1) = a2(p-l)^'2(p-l)- 

Notice that there is an element of order 2{p — 1) in Fp2, since 2{p — l)|p^ — 1. If 
a is a primitive element in Fp2, then w = a'^P+^y^ is a primitive 2{p — l)st root 
of unity. 
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6 Conclusion 

In this paper we derived new formulas for multiplication in F^em , which use only 
15 multiplications in F3™. Being able to efficiently multiply elements in W^em is a 
central task for the computation of the Tate pairing on elliptic and hyperelliptic 
curves. Our method is based on the fast Fourier transform, slightly modified 
to be adapted to the finite fields that we work on. Our software experiments 
show that this method is at least 10% faster than other proposed methods in 
the literature. We have also discussed use of these ideas in conjunction with the 
general methods of Duursma-Lee for Tate pairing computations on elliptic and 
hyperelliptic curves. 
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